feat(trainer): Add environment variables argument to CustomTrainer #54

astefanutti · 2025-07-30T11:41:40Z

What this PR does / why we need it:

This enables users to pass environment variables to the custom trainer.

/assign @kubeflow/kubeflow-trainer-team @szaher @kramaranya @eoinfennessy @briangallagher

Checklist:

Docs included if any changes are user facing

coveralls · 2025-07-30T11:43:15Z

Pull Request Test Coverage Report for Build 16677180921

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

2 of 2 (100.0%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.2%) to 65.679%

Totals
Change from base Build 16655049250:	0.2%
Covered Lines:	266
Relevant Lines:	405

💛 - Coveralls

eoinfennessy

LGTM!

google-oss-prow · 2025-07-31T09:31:08Z

@eoinfennessy: changing LGTM is restricted to collaborators

In response to this:

LGTM!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

andreyvelich

Thanks for this @astefanutti!
I have general question, when do you think we should recommend users to use env or func_args ?
I can imagine that system env should be configured by Platform Admins.

andreyvelich · 2025-07-31T16:38:55Z

python/kubeflow/trainer/types/types.py

        pip_index_url (`Optional[str]`): The PyPI URL from which to install Python packages.
        num_nodes (`Optional[int]`): The number of nodes to use for training.
        resources_per_node (`Optional[Dict]`): The computing resources to allocate per node.
+        env (`Optional[Dict[str, str]]`): Environment variables to set in the training containers.


Let's be more explicit, since SDK users should not be aware of containers ?

Suggested change

env (`Optional[Dict[str, str]]`): Environment variables to set in the training containers.

env (`Optional[Dict[str, str]]`): The environment variables to set in the training nodes.

andreyvelich · 2025-07-31T16:41:46Z

python/kubeflow/trainer/utils/utils.py

+    if trainer.env:
+        env_vars = []
+        for key, value in trainer.env.items():
+            env_vars.append(
+                models.IoK8sApiCoreV1EnvVar(
+                    name=key,
+                    value=value
+                )
+            )
+        trainer_crd.env = env_vars


Maybe something like this:

Suggested change

if trainer.env:

env_vars = []

for key, value in trainer.env.items():

env_vars.append(

models.IoK8sApiCoreV1EnvVar(

name=key,

value=value

)

)

trainer_crd.env = env_vars

if trainer.env:

trainer_crd.env = [

models.IoK8sApiCoreV1EnvVar(name=key, value=value)

for key, value in trainer.env.items()

]

kramaranya

Looks great, thank you @astefanutti!

kramaranya · 2025-07-31T23:41:15Z

python/kubeflow/trainer/api/trainer_client_test.py

    if add_built_in_trainer:
        train_job = add_built_in_trainer_to_job(train_job)
    if add_custom_trainer:


Suggested change

if add_built_in_trainer:

train_job = add_built_in_trainer_to_job(train_job)

if add_custom_trainer:

if add_built_in_trainer:

train_job = add_built_in_trainer_to_job(train_job)

elif add_custom_trainer:

kramaranya · 2025-07-31T23:42:24Z

python/kubeflow/trainer/utils/utils.py

+    if trainer.env:
+        env_vars = []
+        for key, value in trainer.env.items():
+            env_vars.append(
+                models.IoK8sApiCoreV1EnvVar(
+                    name=key,
+                    value=value
+                )
+            )
+        trainer_crd.env = env_vars


Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

astefanutti · 2025-08-01T11:03:25Z

@andreyvelich @kramaranya thanks a lot for the review.

I should have addressed all your comments.

kramaranya · 2025-08-01T11:05:39Z

Thank you for this, @astefanutti!
/lgtm

astefanutti · 2025-08-01T11:11:26Z

Thanks for this @astefanutti! I have general question, when do you think we should recommend users to use env or func_args ? I can imagine that system env should be configured by Platform Admins.

I would say this is convenient and useful for AI practitioners to be able to pass environment variables like HF_TOKEN, PYTHON_*, NCCL_* ...

On the other hand func_args is more about hyper-parameters and the training configuration, while environment variables is a universal mechanism that I'm sure most AI practitioners are already familiar with to configure runtime components.

At least for custom trainers, I would be inclined to think users would prefer being able to pass these variables directly than asking platform admins to configure them each time on training runtimes.

python/kubeflow/trainer/api/trainer_client_test.py

Electronic-Waste

@astefanutti Thanks for creating this! Overall LGTM!

/lgtm

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

andreyvelich

Thanks @astefanutti!
/lgtm
/approve

google-oss-prow · 2025-08-01T14:16:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot requested review from Electronic-Waste and andreyvelich July 30, 2025 11:41

google-oss-prow bot added the size/M label Jul 30, 2025

eoinfennessy approved these changes Jul 31, 2025

View reviewed changes

andreyvelich reviewed Jul 31, 2025

View reviewed changes

kramaranya reviewed Jul 31, 2025

View reviewed changes

astefanutti added 2 commits August 1, 2025 12:41

feat(trainer): Add environment variables argument to CustomTrainer

d792ebc

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

Review feedback

440fcc2

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

astefanutti force-pushed the pr-01 branch from 523bb5c to 440fcc2 Compare August 1, 2025 10:42

google-oss-prow bot assigned kramaranya Aug 1, 2025

google-oss-prow bot added the lgtm label Aug 1, 2025

andreyvelich reviewed Aug 1, 2025

View reviewed changes

python/kubeflow/trainer/api/trainer_client_test.py Outdated Show resolved Hide resolved

google-oss-prow bot removed the lgtm label Aug 1, 2025

Electronic-Waste reviewed Aug 1, 2025

View reviewed changes

google-oss-prow bot assigned Electronic-Waste Aug 1, 2025

google-oss-prow bot added the lgtm label Aug 1, 2025

astefanutti force-pushed the pr-01 branch from 627a5f8 to df42673 Compare August 1, 2025 14:05

google-oss-prow bot removed the lgtm label Aug 1, 2025

Add env argument to get_custom_trainer

359cfd2

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

astefanutti force-pushed the pr-01 branch from df42673 to 359cfd2 Compare August 1, 2025 14:07

andreyvelich reviewed Aug 1, 2025

View reviewed changes

google-oss-prow bot assigned andreyvelich Aug 1, 2025

google-oss-prow bot added the lgtm label Aug 1, 2025

google-oss-prow bot added the approved label Aug 1, 2025

google-oss-prow bot merged commit cc4cbe0 into kubeflow:main Aug 1, 2025
8 checks passed

google-oss-prow bot added this to the v0.1 milestone Aug 1, 2025

	env (`Optional[Dict[str, str]]`): Environment variables to set in the training containers.
	env (`Optional[Dict[str, str]]`): The environment variables to set in the training nodes.

feat(trainer): Add environment variables argument to CustomTrainer #54

feat(trainer): Add environment variables argument to CustomTrainer #54

Uh oh!

Conversation

astefanutti commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 16677180921

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

eoinfennessy left a comment

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Jul 31, 2025

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

andreyvelich Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

andreyvelich Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

kramaranya Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

kramaranya left a comment

Choose a reason for hiding this comment

Uh oh!

kramaranya Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

kramaranya Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

astefanutti commented Aug 1, 2025

Uh oh!

kramaranya commented Aug 1, 2025

Uh oh!

astefanutti commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Electronic-Waste left a comment

Choose a reason for hiding this comment

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Aug 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

astefanutti commented Jul 30, 2025 •

edited

Loading

coveralls commented Jul 30, 2025 •

edited

Loading

astefanutti commented Aug 1, 2025 •

edited

Loading